Project 3, APS1070 Fall 2024ΒΆ
PCA [14 marks]ΒΆ
Deadline: Nov 15th, 23:00
Academic Integrity
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself without proper acknowledgment of the source, including generated code (please refer to the course syllabus). Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
Name:
Student ID:
Marking Scheme:ΒΆ
This project is worth 14 percent of your final grade.
Draw a plot or table where necessary to summarize your findings.
Practice Vectorized coding: If you need to write a loop in your solution, think about how you can implement the same functionality with vectorized operations. Try to avoid loops as much as possible (in some cases, loops are inevitable).
How to submit (HTML + IPYNB)ΒΆ
Download your notebook:
File -> Download .ipynbClick on the Files icon on the far left menu of Colab
Select & upload your
.ipynbfile you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)execute the following in a Colab cell:
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it.
Submit both
HTMLandIPYNBfiles on Quercus for grading.
Part 1: Getting started with GitHub [1.5 Marks + 1.5 Mark Git Submission]ΒΆ
This first part of the project assignment is to be completed independently from Parts 2 - 5. In this part you will be completing some coding tasks and submitting your results on Github. To access this part of the assignment and upload your answers, you will need to use Github. Please complete the following step-by-step instructions:
- Create a Github account and install git for Windows or Mac:
Create a personal access token using your Github account. Go to Settings >> Developer Settings >> Personal access tokens >> Tokens (classic) and generate a new token (also classic). When creating the token make sure to fill the
Notesection and select the repo scope (for repository access, like pushing) and workflow (required to modify workflow files). Make sure you copy the Personal Access Token as soon as it gets generated.Open this link: https://classroom.github.com/a/oMeBXqsA to create your assignment repository in GitHub. You should get a link similar to:
https://github.com/APS1070-UofT/f24-project-3-part-1-*********
This your private repository to get this part questions and upload your answers. Copy this link to the text box below to be graded for this part.
### Add the link here ###
#https://github.com/APS1070-UofT/f24-project-3-part-1-XinleCui
- Open
Git Bash, the app you downloaded in step0, and set your Email and username by:
git config --global user.email β<your-GitHub-email>β
git config --global user.name β<your-GitHub-username>β
- Create a folder for the course on your computer and
cdto that.cdmeansChange Directory. For example, on a Windows machine, where I have a folder on "C:\aps1070":
cd c:aps1070
- Get your assignment by the link you got in step 2:
git clone https://github.com/APS1070-UofT/f24-project-3-part-1-*********
You will be asked to enter your Github username and password. Enter the username for your github account into the
Usernamefield, and paste thepersonal access tokenwhich you copied in step 1, into thePasswordfield.A new folder should be created in your directory similar to:
C:\aps1070\f24-project-3-part-1-********
This folder has an ipynb notebook which you need to manually upload to colab and answer its questions.
After you finished working on this notebook, download the notebook from colab and move it to the directory in step 7.
Replace the old notebook with the new one that has your answers. Make sure your completed notebook has the same name as the original notebook you downloaded.
To submit your work, follow:
cd <your assignment folder>
git add F24_Project_3_Part_1_git.ipynb
git commit -m "Final Submission"
git push
If you have any problem with pushing your work on GitHub you can try one of following commands:
git push --force
or
git push origin HEAD:main
- Make sure your submission is ready for grading. Open the private repository link in your browser and make sure you can see your final submission with your latest changes there. Only you and the teaching team can open that link.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
### MOVE YOUR ANSWER TO PART 1 HERE ###
data_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/Sabaae/Dataset/main/domestic_product.csv',
index_col=0
)
print(data_raw.head())
1970 1971 1972 1973 \
Country
Afghanistan 1.731436e+09 1.812838e+09 1.647900e+09 1.702716e+09
Albania 2.265577e+09 2.330585e+09 2.397526e+09 2.466543e+09
Algeria 5.155121e+09 5.363676e+09 7.176428e+09 9.229622e+09
Andorra 9.947679e+07 1.131285e+08 1.435060e+08 1.908638e+08
Angola 3.806982e+09 4.007445e+09 4.102155e+09 5.016276e+09
1974 1975 1976 1977 \
Country
Afghanistan 2.061729e+09 2.319754e+09 2.513581e+09 2.864939e+09
Albania 2.537016e+09 2.610110e+09 2.685746e+09 2.760746e+09
Algeria 1.325976e+10 1.555555e+10 1.775003e+10 2.099094e+10
Andorra 2.360555e+08 2.785143e+08 2.875884e+08 3.213901e+08
Angola 5.626961e+09 4.147433e+09 3.980907e+09 4.344225e+09
1978 1979 ... 2012 2013 \
Country ...
Afghanistan 3.222964e+09 3.630768e+09 ... 1.913650e+10 1.962180e+10
Albania 2.842276e+09 2.372306e+09 ... 1.231978e+10 1.277628e+10
Algeria 2.637376e+10 3.320143e+10 ... 2.090000e+11 2.100000e+11
Andorra 3.897459e+08 5.207423e+08 ... 3.188830e+09 3.192603e+09
Angola 4.845482e+09 5.380155e+09 ... 1.280000e+11 1.370000e+11
2014 2015 2016 2017 \
Country
Afghanistan 1.955070e+10 1.871305e+10 1.801956e+10 1.889635e+10
Albania 1.322824e+10 1.138692e+10 1.186136e+10 1.301973e+10
Algeria 2.140000e+11 1.660000e+11 1.600000e+11 1.700000e+11
Andorra 3.267334e+09 2.788758e+09 2.894646e+09 3.000160e+09
Angola 1.460000e+11 1.160000e+11 1.010000e+11 1.220000e+11
2018 2019 2020 2021
Country
Afghanistan 1.841885e+10 1.890449e+10 2.014344e+10 1.493901e+10
Albania 1.515644e+10 1.540183e+10 1.513187e+10 1.826004e+10
Algeria 1.750000e+11 1.720000e+11 1.460000e+11 1.630000e+11
Andorra 3.218419e+09 3.155152e+09 2.890998e+09 3.324648e+09
Angola 1.010000e+11 8.313674e+10 5.482108e+10 7.053324e+10
[5 rows x 52 columns]
# Adjust the Dataframe, Transpose and Reset Index
data_raw = data_raw.T
data_raw = data_raw.reset_index()
data_raw = data_raw.rename(columns={'index': 'Year'})
data_raw.columns = data_raw.columns.str.strip()
data_raw.head()
| Country | Year | Afghanistan | Albania | Algeria | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Aruba | ... | Uganda | United Arab Emirates | United Kingdom | United States | Uruguay | Vanuatu | Venezuela (Bolivarian Republic of) | Viet Nam | Zambia | Zimbabwe |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1970 | 1.731436e+09 | 2.265577e+09 | 5.155121e+09 | 99476787.0 | 3.806982e+09 | 3902380.0 | 33865823.0 | 3.398453e+10 | 197623955.0 | ... | 1.423837e+09 | 1.067267e+09 | 1.310000e+11 | 1.070000e+12 | 2.537645e+09 | 38009945.0 | 1.386364e+10 | 2.775103e+09 | 1.544192e+09 | 2.336518e+09 |
| 1 | 1971 | 1.812838e+09 | 2.330585e+09 | 5.363676e+09 | 113128492.0 | 4.007445e+09 | 4589878.0 | 41156057.0 | 4.065625e+10 | 208158924.0 | ... | 1.575309e+09 | 1.462298e+09 | 1.480000e+11 | 1.160000e+12 | 3.048552e+09 | 41066477.0 | 1.522727e+10 | 3.023992e+09 | 1.574165e+09 | 2.701725e+09 |
| 2 | 1972 | 1.647900e+09 | 2.397526e+09 | 7.176428e+09 | 143506004.0 | 4.102155e+09 | 5416747.0 | 56405489.0 | 3.831047e+10 | 219255492.0 | ... | 1.711634e+09 | 2.201610e+09 | 1.700000e+11 | 1.280000e+12 | 2.450067e+09 | 54568338.0 | 1.674419e+10 | 3.236929e+09 | 1.851647e+09 | 3.336472e+09 |
| 3 | 1973 | 1.702716e+09 | 2.466543e+09 | 9.229622e+09 | 190863792.0 | 5.016276e+09 | 6129013.0 | 78360329.0 | 4.314063e+10 | 230943598.0 | ... | 2.003225e+09 | 4.267683e+09 | 1.930000e+11 | 1.430000e+12 | 3.128677e+09 | 66567751.0 | 2.000000e+10 | 3.350312e+09 | 2.349834e+09 | 4.103781e+09 |
| 4 | 1974 | 2.061729e+09 | 2.537016e+09 | 1.325976e+10 | 236055468.0 | 5.626961e+09 | 6749690.0 | 83529500.0 | 4.853508e+10 | 243254775.0 | ... | 2.425947e+09 | 1.176896e+10 | 2.060000e+11 | 1.550000e+12 | 4.334171e+09 | 66380251.0 | 3.046512e+10 | 3.604731e+09 | 2.894548e+09 | 4.938099e+09 |
5 rows Γ 180 columns
# Define a function to plot the time series
def plot_time_series(df, country_list, scaler):
if scaler is False:
for country in country_list:
if country in df.columns:
plt.figure(figsize=(12, 6))
plt.plot(df['Year'], df[country], marker='o', linestyle='-')
plt.title(f'Time Series for {country}')
plt.xlabel('Year')
plt.ylabel('GDP(Billion)')
# Chang the form of y-axis to non-scientific
plt.gca().get_yaxis().get_major_formatter().set_scientific(False)
# Set the y-axis to show in billions
plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f'{x/1e9:.1f}B'))
# Solve the x-axis label overlapping issue
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.tight_layout()
plt.show()
else:
print(f"Country '{country}' not found in the dataframe.")
else:
for country in country_list:
if country in df.columns:
plt.figure(figsize=(12, 6))
plt.plot(df['Year'], df[country], marker='o', linestyle='-')
plt.title(f'Time Series for {country}')
plt.xlabel('Year')
plt.ylabel('Standard Scale GDP')
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.tight_layout()
plt.show()
else:
print(f"Country '{country}' not found in the dataframe.")
# Get the list of countries
selected_countries = ['Albania', 'Greece', 'Somalia', 'Sweden', 'Oman', 'Italy']
# Plot the Non-StandardScaler
print("Original Data Time Series:")
plot_time_series(data_raw, selected_countries, scaler=False)
Original Data Time Series:
# Standardize the data
scaler = StandardScaler()
# Create an empty list to store the scaled data
scaled_data = []
# Loop through the data by year
for year, year_data in data_raw.groupby('Year'):
country_data = year_data.iloc[:, 2:]
scaled_values = scaler.fit_transform(country_data.T).T
scaled_year_data = year_data.copy()
scaled_year_data.iloc[:, 2:] = scaled_values
scaled_data.append(scaled_year_data)
# Concatenate all the scaled data
data_scaled = pd.concat(scaled_data).reset_index(drop=True)
# Plot the StandardScaler
print("StandardScaler Data Time Series:")
plot_time_series(data_scaled, selected_countries, scaler=True)
StandardScaler Data Time Series:
# Reset the dataset
data_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/Sabaae/Dataset/main/domestic_product.csv',
index_col=0
)
Part 2: Applying PCA [2 Marks]ΒΆ
- Compute the covariance matrix of the dataframe. Hint: The dimensions of your covariance matrix should be (52, 52). [0.25]
- Write a function
get_sorted_eigen(df_cov)that gets the covariance matrix of dataframedf(from step 1), and returns sorted eigenvalues and eigenvectors usingnp.linalg.eigh. [0.25] - Show the effectiveness of your principal components in covering the variance of the dataset with a
scree plot. [0.25] - How many PCs do you need to cover 99.9% of the dataset's variance? [0.25]
- Plot the first 16 principal components (Eigenvectors) as a time series (16 subplots, on the x-axis you have dates and on the y-axis you have the value of the PC element) . [0.5]
- Compare the first few PCs with the rest of them. Do you see any difference in their trend? [0.5]
The first few PCA(1&2) appear to have smoother, more discernible trends.
# Standardize the data
data_scaled_array = scaler.fit_transform(data_raw)
data_scaled = pd.DataFrame(data_scaled_array, columns=data_raw.columns, index=data_raw.index)
# Compute the covariance matrix
n, m = data_scaled.shape
# Compute covariance matrix
C = np.cov(data_scaled.T)
# Print the shape of the covariance matrix
print(C.shape)
(52, 52)
# Write a function that gets the covariance matrix of dataframe returns sorted eigenvalues and eigenvectors
def get_sorted_eigen(df_cov):
# Get the eigenvalues and eigenvectors
eigenValues, eigenVectors = np.linalg.eigh(df_cov)
# Sort the eigenvalues in descending order
args = (-eigenValues).argsort()
# Get the sorted eigenvalues and eigenvectors
eigenValues = eigenValues[args]
eigenVectors = eigenVectors[:, args]
return eigenValues, eigenVectors
eigenValues, eigenVectors = get_sorted_eigen(C)
# Show the effectiveness of your principal components in covering the variance of the dataset with a `scree plot`
eigValSum = sum(eigenValues)
expVar = [(eigV / eigValSum) * 100 for eigV in eigenValues]
cumExpVar = np.cumsum(expVar)
# Plot the scree plot
plt.figure(figsize=(10, 5))
plt.bar(range(1, 53), expVar, color='lightblue', edgecolor='black', label='Explained Variance')
plt.plot(range(1, 53), cumExpVar, color='red', marker='o', linestyle='-', label='Cumulative Explained Variance')
plt.xlabel('Principal Component', fontsize=12)
plt.ylabel('Variance Explained (%)', fontsize=12)
plt.title('Explained and Cumulative Explained Variance', fontsize=14)
plt.grid()
plt.legend()
# Show the plot
plt.tight_layout()
plt.show()
# Find the number of principal components that can cover 99.9% of the dataset's variance
num_components = next(i + 1 for i, var in enumerate(cumExpVar) if var >= 99.9)
# Print the result
print(f'{num_components} Principal Components can cover {cumExpVar[num_components - 1]:.2f}% of the dataset\'s variance')
5 Principal Components can cover 99.91% of the dataset's variance
# Plot the first 16 principal components (Eigenvectors) as a time series
# Crete a array of years from 1970 to 2022
years = np.arange(1970, 2022)
# Plot the first 16 principal components
plt.figure(figsize=(25, 50))
for i in range(16):
plt.subplot(8, 2, i+1)
plt.plot(years, eigenVectors[:, i])
plt.xticks(np.arange(1970, 2022), rotation=45)
plt.title('PC' + str(i+1))
plt.xlabel('Year')
plt.ylabel('Value')
plt.tight_layout()
plt.show()
Part 3: Data reconstruction [4.5 Marks]ΒΆ
Create a function that:
Accepts a country and the original dataset as inputs.
Calls useful functions that you designed in previous parts to compute eigenvectors and eigenvalues.
Plots 4 figures:
- The original time-series for the specified country. [0.5]
- The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot. [1.5]
- You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
- Reconstruction with only PC1
- Reconstruction with both PC1 and PC2
- Reconstruction with PC1 to PC4 (First 4 PCs)
- Reconstruction with PC1 to PC8 (First 8 PCs)
- Reconstruction with PC1 to PC16 (First 16 PCs)
- Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler
inverse_transformhelp...
- The residual error for your best reconstruction with respect to the original time-series. [1]
- Hint: You are plotting the error that we have for reconstructing each year
(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.
- The RMSE of the reconstruction as a function of the number of included components (x-axis is the number of components and y-axis is the RMSE). Sweep x-axis from 1 to 10 (this part is independent from part 3.2.) [1]
Test your function using the three countries you selected in Part 1.3 plus Greece,Somalia, and Italy as inputs. [0.5]
# Combined function to plot all required figures
def plot_country_figures(original_df, country_name):
# Standardize the entire dataset
scaler = StandardScaler()
original_scaled = scaler.fit_transform(original_df.values)
# Compute the covariance matrix and eigenvectors
C = np.cov(original_scaled.T)
eigenValues, eigenVectors = get_sorted_eigen(C)
# Step 3.3: Extract the data for the specific country
country_data = original_df.loc[country_name].values.reshape(1, -1)
country_data_standard = scaler.transform(country_data)
# Step 3.4: Plotting the required figures
plt.figure(figsize=(18, 12))
# 1. Original Time-Series for the country
plt.subplot(2, 2, 1)
original_df.loc[country_name, :].plot(title="Original Time-Series for " + country_name)
plt.xlabel("Year")
plt.ylabel("Value")
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 2. Incremental Reconstruction of the non-standardized time-series
plt.subplot(2, 2, 2)
num_components_list = [1, 2, 4, 8, 16]
for num_components in num_components_list:
W = eigenVectors[:, :num_components]
country_projected = np.dot(country_data_standard, W)
reconstructed_data_standard = np.dot(country_projected, W.T)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
plt.plot(original_df.columns, reconstructed_data_original.flatten(), label=f"PC{num_components}")
plt.xlabel("Year")
plt.ylabel("Value")
plt.title("Incremental Reconstruction of the non-standardized time-series for " + country_name)
plt.legend()
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 3. Residual Error for the Best Reconstruction (using 16 PCs)
plt.subplot(2, 2, 3)
best_num_components = min(16, original_df.shape[1])
W_best = eigenVectors[:, :best_num_components]
country_projected_best = np.dot(country_data_standard, W_best)
best_reconstructed_data_standard = np.dot(country_projected_best, W_best.T)
best_reconstructed_data_original = scaler.inverse_transform(best_reconstructed_data_standard)
residual_error = country_data.flatten() - best_reconstructed_data_original.flatten()
plt.plot(original_df.columns, residual_error, label=f"PC{best_num_components} Residual Error")
plt.xlabel("Year")
plt.ylabel("Error")
plt.title("Residual Error for the Best Reconstruction for " + country_name)
plt.legend()
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 4. RMSE of Reconstruction as a function of the number of components
plt.subplot(2, 2, 4)
rmse_list = []
components_range = range(1, min(11, original_df.shape[1] + 1))
for num_components in components_range:
W = eigenVectors[:, :num_components]
country_projected = np.dot(country_data_standard, W)
reconstructed_data_standard = np.dot(country_projected, W.T)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
rmse = np.sqrt(mean_squared_error(country_data, reconstructed_data_original))
rmse_list.append(rmse)
plt.plot(components_range, rmse_list, marker='o', label='RMSE (PC1-10)')
plt.xlabel('Number of PCs')
plt.ylabel('RMSE')
plt.title("RMSE vs Number of Principal Components for " + country_name)
plt.legend()
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.show()
country_list = [' Iceland ', ' Netherlands ', ' Liberia ', ' Greece ', ' Somalia ', ' Italy ']
for country in country_list:
plot_country_figures(data_raw, country)
Part 4: SVD [2 Marks]ΒΆ
Modify your code in part 3 to use SVD instead of PCA. [1]
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5].
Standardization is not mandatory required for performing SVD from a mathematical standpoint. However, as covariance is not computed during the process. Without standardization, the singular values corresponding to different features may differ significantly in magnitude. This can make SVD overly sensitive to features with large singular values, potentially leading to suboptimal dimensionality reduction. Standardizing the data ensures that all features are given equal importance, thereby improving the effectiveness of SVD.
The identical RMSE for PCA and SVD in the plot arises from their mathematical equivalence in dimensionality reduction. Both methods use the same principal directions to reconstruct the data, leading to identical reconstruction errors. In PCA, we decompose the covariance matrix C=A^TA to find eigenvectors (principal components) and eigenvalues, which capture the directions and variances in the data. In SVD, we decompose the original data matrix A directly into UΞ£V^T, where the right singular vectors V align with the eigenvectors of C, and the squared singular values correspond to the eigenvalues. Thus, whether we use PCAβs eigen decomposition or SVD, the projections and reconstructions are the same, yielding identical RMSE values.
# Combined function to plot all required figures
def plot_country_figures_svd(original_df, country_name):
# Standardize only the numeric columns
numeric_df = original_df.select_dtypes(include=['float64', 'int64'])
# Standardize the entire dataset
scaler = StandardScaler()
original_scaled = scaler.fit_transform(numeric_df.values)
# Compute the leff singular vectors, singular values, and right singular vectors
U, S, V = np.linalg.svd(original_scaled, full_matrices=False)
# Step 3.3: Extract the data for the specific country
country_data = original_df.loc[country_name].values.reshape(1, -1)
country_data_standard = scaler.transform(country_data)
# Step 3.4: Plotting the required figures
plt.figure(figsize=(18, 12))
# 1. Original Time-Series for the country
plt.subplot(2, 2, 1)
original_df.loc[country_name, :].plot(title="Original Time-Series for " + country_name)
plt.xlabel("Year")
plt.ylabel("Value")
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 2. Incremental Reconstruction
plt.subplot(2, 2, 2)
num_components_list = [1, 2, 4, 8, 16]
for num_components in num_components_list:
v = V[:num_components, :]
country_projected = np.dot(country_data_standard, v.T)
reconstructed_data_standard = np.dot(country_projected, v)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
plt.plot(original_df.columns, reconstructed_data_original.flatten(), label=f"SVD{num_components}")
plt.xlabel("Year")
plt.ylabel("Value")
plt.title("Incremental Reconstruction" + country_name)
plt.legend()
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 3. Residual Error for the Best Reconstruction (using 16 svds)
plt.subplot(2, 2, 3)
best_num_components = min(16, original_df.shape[1])
v_best = V[:best_num_components, :]
country_projected_best = np.dot(country_data_standard, v_best.T)
best_reconstructed_data_standard = np.dot(country_projected_best, v_best)
best_reconstructed_data_original = scaler.inverse_transform(best_reconstructed_data_standard)
residual_error = country_data.flatten() - best_reconstructed_data_original.flatten()
plt.plot(original_df.columns, residual_error, label=f"SVD{best_num_components} Residual Error")
plt.xlabel("Year")
plt.ylabel("Error")
plt.title("Residual Error for the Best Reconstruction for " + country_name)
plt.legend()
plt.xticks(np.arange(0, len(original_df.columns), 5), rotation=45)
plt.grid()
# 4. RMSE of Reconstruction as a function of the number of components
plt.subplot(2, 2, 4)
rmse_list = []
components_range = range(1, min(11, original_df.shape[1] + 1))
for num_components in components_range:
v = V[:num_components, :]
country_projected = np.dot(country_data_standard, v.T)
reconstructed_data_standard = np.dot(country_projected, v)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
rmse = np.sqrt(mean_squared_error(country_data, reconstructed_data_original))
rmse_list.append(rmse)
plt.plot(components_range, rmse_list, marker='o', label='RMSE (SVD1-10)')
plt.xlabel('Number of SVDs')
plt.ylabel('RMSE')
plt.title("RMSE vs Number of SVDs for " + country_name)
plt.legend()
plt.xticks(rotation=45)
plt.grid()
plt.tight_layout()
plt.show()
country_list = [' Iceland ', ' Netherlands ', ' Liberia ', ' Greece ', ' Somalia ', ' Italy ']
for country in country_list:
plot_country_figures_svd(data_raw, country)
def pca_vs_svd(original_df, country_name):
# Standardize the data
scaler = StandardScaler()
original_scaled = scaler.fit_transform(original_df.values)
# Step 1: PCA Analysis
C = np.cov(original_scaled.T)
eigenValues, eigenVectors = get_sorted_eigen(C)
country_data = original_df.loc[country_name].values.reshape(1, -1)
country_data_standard = scaler.transform(country_data)
rmse_pca_list = []
components_range = range(1, min(11, original_df.shape[1] + 1))
for num_components in components_range:
W = eigenVectors[:, :num_components]
country_projected = np.dot(country_data_standard, W)
reconstructed_data_standard = np.dot(country_projected, W.T)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
rmse_pca = np.sqrt(mean_squared_error(country_data, reconstructed_data_original))
rmse_pca_list.append(rmse_pca)
# Step 2: SVD Analysis
U, S, V = np.linalg.svd(original_scaled, full_matrices=False)
rmse_svd_list = []
for num_components in components_range:
v = V[:num_components, :]
country_projected = np.dot(country_data_standard, v.T)
reconstructed_data_standard = np.dot(country_projected, v)
reconstructed_data_original = scaler.inverse_transform(reconstructed_data_standard)
rmse_svd = np.sqrt(mean_squared_error(country_data, reconstructed_data_original))
rmse_svd_list.append(rmse_svd)
# Step 3: Plot Comparison of RMSE vs Number of Components
plt.figure(figsize=(10, 6))
plt.plot(components_range, rmse_pca_list, marker='o', label='PCA RMSE')
plt.plot(components_range, rmse_svd_list, marker='x', label='SVD RMSE')
plt.xlabel('Number of Components')
plt.ylabel('RMSE')
plt.title(f'RMSE Comparison for PCA vs SVD - {country_name}')
plt.legend()
plt.grid()
plt.show()
country_list = [' Iceland ', ' Netherlands ', ' Liberia ', ' Greece ', ' Somalia ', ' Italy ']
for country in country_list:
pca_vs_svd(data_raw, country)
Part 5: Let's collect another dataset! [2.5 Marks]ΒΆ
Create another dataset similar to the dataset in part 1, this time using the raw information on average monthly electricity sales across various states of the United States from 2001 to 2024 here. [1]
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). Missing values were removed such that if there was a missing value for the average electricity sales of a particular state at a given date, that date has been completely removed from the dataset, even if the data of that specific date existed for other states.
You are free to use any tools you like, from Excel to Python! In the end, you should have a new CSV file similar to the previous dataset. How many features does the final dataset have? How many cities are there?
62 cities, and 277 features
Upload your new dataset (in CSV format) to your colab notebook, repeat part 4 for this dataset [1], and comment on the results [0.5]. When analyzing the states, use New York, Utah, and three other states with the closest alphabetical names to your first name.
Looking at the time series analysis and dimensionality reduction comparison across five states (New York, Utah, Indiana, New Jersey, and Louisiana), several key patterns emerge. In the initial analysis, each state demonstrates significant improvement in reconstruction accuracy as SVD components increase from 1 to 16, with the most improvements occurring in the first 4 components. The states show vary patterns - Utah displays a distinctive upward trend while others exhibit more cyclical patterns. The residual error plots with SVD16 reveal mostly random errors centered around zero, indicating effective signal capture. When comparing PCA and SVD methods, the results show identical RMSE values between both techniques. Both methods demonstrate substantial error reduction as components increase from 1 to 10, with the most significant improvements occurring within the first 2-4 components. This suggests that both PCA and SVD are equally effective for dimensionality reduction in this context, with early components capturing the most important patterns in the data.
The code below helps you to upload your new CSV file to your colab session.
# get url
url = 'https://raw.githubusercontent.com/Sabaae/Dataset/main/electricity_prices.csv'
df = pd.read_csv(url)
df.drop(['Unnamed: 0', 'month', 'price', 'revenue'], axis=1)
df = df[df['sectorName'] == 'all sectors']
df.drop(['sectorName'], axis=1, inplace=True)
# Step 2: Pivot the DataFrame to create a time series like structure
df_pivot = df.pivot(index='stateDescription', columns='date', values='sales')
# Remove the states with missing values
df_pivot.dropna(inplace=True)
# Display the reshaped DataFrame
print(df_pivot.head())
# To csv
df_pivot.to_csv('electricity_sales.csv')
date 2001-01-01 2001-02-01 2001-03-01 2001-04-01 \ stateDescription Alabama 7362.47302 6041.02574 5894.61038 6064.53539 Alaska 521.03566 486.09655 443.79575 442.19678 Arizona 4786.79176 4416.00228 4362.10349 4184.61427 Arkansas 3804.21013 3256.86530 3031.50981 2973.53436 California 21744.31668 18305.55916 21351.40877 18855.59986 date 2001-05-01 2001-06-01 2001-07-01 2001-08-01 \ stateDescription Alabama 6413.96530 7169.37956 7794.07207 7906.56345 Alaska 415.90066 414.23448 421.09334 422.70564 Arizona 5131.87593 5816.22924 6475.18991 6529.11864 Arkansas 3076.88704 3550.33976 4176.61212 4428.11278 California 19862.25747 21194.24204 21538.59260 23772.48986 date 2001-09-01 2001-10-01 ... 2023-04-01 2023-05-01 \ stateDescription ... Alabama 7102.06398 5983.28302 ... 6188.47785 6722.44477 Alaska 429.95111 437.61982 ... 491.61997 469.50733 Arizona 6130.78171 5346.71813 ... 5889.67115 7001.27492 Arkansas 4072.36780 3347.34134 ... 3426.56209 3665.63599 California 20600.89551 21599.95702 ... 16154.09389 18182.38622 date 2023-06-01 2023-07-01 2023-08-01 2023-09-01 \ stateDescription Alabama 7614.18498 8609.33791 8944.48439 7856.24154 Alaska 447.40639 470.75186 477.70227 455.98475 Arizona 7727.29244 10673.30122 10121.75565 8085.44509 Arkansas 4232.23630 4852.03809 5216.01755 4718.83341 California 19223.82354 22940.58818 25018.24230 22347.58343 date 2023-10-01 2023-11-01 2023-12-01 2024-01-01 stateDescription Alabama 6516.62250 6391.99899 6997.08197 8019.81521 Alaska 488.87172 506.72640 565.94126 588.20807 Arizona 6987.36561 5615.36550 5938.68825 6326.08483 Arkansas 3835.56799 3651.22771 3939.32635 4598.63147 California 20282.41442 18378.73369 18560.06591 20044.01987 [5 rows x 277 columns]
df_pivot.shape
(62, 277)
# load train.csv to Google Colab
#rom google.colab import files
#uploaded = df.upload()
### YOUR CODE HERE ###
state_list = ['New York', 'Utah', 'Indiana','New Jersey','Louisiana']
for state in state_list:
plot_country_figures_svd(df_pivot, state)
state_list = ['New York', 'Utah', 'Indiana','New Jersey','Louisiana']
for state in state_list:
pca_vs_svd(df_pivot, state)
ReferencesΒΆ
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
U.S. Electricity Prices Data: